Chinese Word Segmentation based on Maximum Matching and Word Binding Force

نویسندگان

  • Pak-Kwong Wong
  • Chorkin Chan
چکیده

l)okfulam ih)a,d thmg Kong pkwong((~cs.hku.hk and Abstract A Chinese word Seglnentation algorithm based on forward icnaxinnlln matching and word binding force is t)roposed in this pai)er. This algorithm Iilays a key role in post-processing the outtmt of a character or st/eech recognizer in determining the proper word sequence c(/rre-st)onding to an input line of cha.raeter images or a speech wav(~,fol'tn. ~FO support this algorithm, a text; (:orims of over 63 millions characters is employed to enrich an 80,O00-words lexi(:on in terlns of its word entries and word binding forces. As it stands now, given an input line of text, the word segmentor can proce, ss on the average 210,000 characters per se(:-ond when running on an IBM RISC Sys-tem/6000 3BT workstation with a col rect word identitication rate of 99.74%. 1 Introduction A language model as a t)ost-processor is esse, ntial to a recognizer of speech or characters in order to determine the approi)riate word se, que, n(:e and henc.e the semantics of an inI)ut line of text or utterance. It is well known that an N-gram statistics language model is just as effective as, t)ut nmch more eificient than, a syntactk:/semantic analyser in determining the correct word sequence. A necessary condition to successflfl collection of N-gram statistics is the existence of a coInprehensive le, x-icon and a large text corpus. The latter must tie lexically analysed in order to identify all the words, from which, N-gram statistics can be derived. About 5,000 characters are being used in modern Chinese and they are the building blocks of all wor(ls. Ahnost every character is a word and inost words are of one or two characters long but there are also abundant wor(ls longer than two characters. Before it; is seginented into words, a line of text is just a sequence of characters and there are numerous word segmentation alternatives. Usually , all but one of these alternatives arc syntactically and/or semantically incorrect. This is l;he case because unlike texts in English, Chinese texl;s have no word nlarkers. A tirst step towmds buiht-ing a language model based on N-gram statistics is to de, vek)p an etIMent lexical analyser to id(!ntify all the words in the, corpus. Word segmentation algorithlns behmg to one of two types ill general, viz., the structural (Wang et al. rt;spec-tively. A structural algorithm resolves segmenta-tion mnbiguities by examining the structural rcla-tionships between words, while a statistical algorithm compares …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Simple and Effective Closed Test for Chinese Word Segmentation Based on Sequence Labeling

In many Chinese text processing tasks, Chinese word segmentation is a vital and required step. Various methods have been proposed to address this problem using machine learning algorithm in previous studies. In order to achieve high performance, many studies used external resources and combined with various machine learning algorithms to help segmentation. The goal of this paper is to construct...

متن کامل

Report to BMM-based Chinese Word Segmentor with Context-based Unknown Word Identifier for the Second International Chinese Word Segmentation Bakeoff

This paper describes a Chinese word segmentor (CWS) based on backward maximum matching (BMM) technique for the 2 nd Chinese Word Segmentation Bakeoff in the Microsoft Research (MSR) closed testing track. Our CWS comprises of a context-based Chinese unknown word identifier (UWI). All the context-based knowledge for the UWI is fully automatically generated by the MSR training corpus. According to...

متن کامل

Voting between Dictionary-Based and Subword Tagging Models for Chinese Word Segmentation

This paper describes a Chinese word segmentation system that is based on majority voting among three models: a forward maximum matching model, a conditional random field (CRF) model using maximum subword-based tagging, and a CRF model using minimum subwordbased tagging. In addition, it contains a post-processing component to deal with inconsistencies. Testing on the closed track of CityU, MSRA ...

متن کامل

Chinese Word Boundaries Detection Based on Maximum Entropy Model

Among the language texts in natural language, Chinese texts are written in a continuous way with ideographic characters. Unlike other western language texts such as English, Portuguese, etc., delimiters are used to specify the word boundaries. Hence, for any Chinese information processing system such as automatic question and answering, web information retrieval, text to speech conversion, mach...

متن کامل

Effective Subsequence-based Tagging for Chinese Word Segmentation

Effective Subsequence-based Tagging for Chinese Word Segmentation Hai Zhao, Chunyu Kit (1. Department of Chinese, Translation and Linguistics, City University of Hong Kong, 83 Tat Avenue, Kowloon, Hong Kong SAR, China) Abstract: The research of automatic Chinese word segmentation has been advancing rapidly in recent years, especially since the First International Chinese Word Segmentation Bakeo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996